Skip to content

Add fraud-detection example (IEEE-CIS)#140

Open
ZhengyaoJiang wants to merge 7 commits intodevfrom
vk/fraud-detection-example
Open

Add fraud-detection example (IEEE-CIS)#140
ZhengyaoJiang wants to merge 7 commits intodevfrom
vk/fraud-detection-example

Conversation

@ZhengyaoJiang
Copy link
Copy Markdown
Contributor

Summary

  • Reproducible Weco example on real Vesta payment transactions (IEEE-CIS Fraud Detection Kaggle dataset).
  • Mirrors the published case study (blog, repo): baseline AUC ≈ 0.914, pooled 6-seed mean 0.9305 ± 0.0035 after 200 steps with `gemini-3.1-pro-preview` + the bundled `instructions.md`.
  • Scope: both feature engineering (`build_features`) and model config (`train_and_evaluate`) in `train.py` are optimizable. Weco parses `auc_roc: 0.xxxxxx` from the evaluator.

What's in the example

  • `prepare_data.py` — Kaggle download, label-encode + V-feature correlation pruning, time-based 80/20 split, subsample to 100K/25K parquet files. Uses `python -m kaggle.cli` so the venv's bin/ doesn't need to be on PATH; prints a helpful hint on 403 (rules not accepted / kaggle.json perms).
  • `train.py` — Weco's optimization target. Leakage-safe baseline: drops `isFraud` before any cross-column aggregation.
  • `evaluate.py` — reimports `train.py` each run, prints the metric line.
  • `instructions.md` — the full EDA + techniques prompt from the case study, with a silent-target-leakage guardrail.
  • `README.md` — venv setup (PEP 668 safe), data prep, baseline sanity check, Weco run command, "things to try" ablations, and a pointer to the leakage trap.

Verification

Two rounds of fresh-agent testing caught and fixed: venv prereq on modern Python installs; `python3` vs `python` on Ubuntu; `kaggle` package has no `main` so needed `kaggle.cli`. Final sanity check blocked on `403 Forbidden` from the Kaggle API (rules-accept is a per-user prereq, called out in the README).

Test plan

  • Accept competition rules at https://www.kaggle.com/c/ieee-fraud-detection
  • `cd examples/fraud-detection && python3 -m venv .venv && source .venv/bin/activate`
  • `pip install -r requirements.txt`
  • `python prepare_data.py` produces `data/base_train_small.parquet` and `data/base_val_small.parquet`
  • `python evaluate.py` prints `auc_roc: 0.91x`
  • `weco run ...` (full command in README) moves AUC into 0.928–0.933 by step ~30

🤖 Generated with Claude Code

ZhengyaoJiang and others added 2 commits April 23, 2026 15:25
Self-contained reproduction of Weco's fraud-detection case study. Downloads
the Kaggle dataset, builds a leakage-safe 100K/25K time-based parquet split,
and exposes train.py as the optimization target (feature engineering +
LightGBM config both modifiable). evaluate.py prints auc_roc for Weco.

instructions.md is the full EDA + techniques prompt from the case study —
column semantics for each feature group (TransactionAmt, C/D/M/V), 10
well-known IEEE-CIS techniques (UID construction, target encoding with OOF,
velocity features, frequency encoding), and a target-leakage guardrail
pointing out the isFraud-in-df aggregation trap.

README walks through Kaggle API setup, prepare_data step, baseline sanity
check (~0.914 AUC), and the canonical weco run command
(gemini-3.1-pro-preview, 50 steps, expected trajectory into 0.928-0.933).
Also adds 'things to try' (no-instructions variance blow-up, EDA-only
ablation, scope restriction) and a silent-target-leakage watch-out pointing
to the published case study.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er test

Two fresh-agent test rounds surfaced three issues; all fixed:

- kaggle CLI: the `kaggle` package has no __main__, so `python -m kaggle`
  crashes with ModuleNotFoundError. Correct entry point is `kaggle.cli`.
- venv instruction used `python -m venv`, which fails on Debian/Ubuntu
  systems where only `python3` exists (no python-is-python3). Changed to
  `python3 -m venv`. After activation `python` resolves correctly.
- pip-install fails on modern PEP 668 systems without a venv. README now
  leads with the venv setup before the install step, with a note on why.

Also: prepare_data.py now catches Kaggle CalledProcessError and prints
the two most common root causes (rules not accepted / kaggle.json perms)
with the exact URL to accept the competition rules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92cb31d6a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread examples/fraud-detection/train.py Outdated
y_val = val_df["isFraud"].values.astype(np.int32)

n_train = len(train_df)
df = pd.concat([train_df, val_df], axis=0, ignore_index=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fit feature aggregations on training data only

build_features concatenates train_df and val_df before creating grouped amount statistics and frequency encodings, so validation rows (future data in this time-based split) directly shape the engineered features used for evaluation. That leaks validation distribution into the pipeline and can systematically inflate the reported AUC that Weco optimizes against. Compute these encodings/aggregations from train_df only, then map them onto val_df with defaults for unseen keys.

Useful? React with 👍 / 👎.

ZhengyaoJiang and others added 5 commits April 23, 2026 16:41
Codex flagged that the baseline concatenates train + val before computing
groupby aggregations and frequency encodings, letting val-period
distribution shape train features and letting each val row influence its
own encoded values. Even with isFraud dropped first, this is time-leakage
that inflates val AUC vs. what would be seen at serving time.

Fix: compute all encoders (card1/addr1 amount stats, frequency encoding)
on train_df only; .join/.map onto both splits; fill unseen val keys with
train-global defaults. Refactored per-row features (time, amount) into a
small helper so both splits share that code path without concat.

Baseline AUC drops from the previously-reported 0.914 to 0.910 — the
right number, not artificially inflated. Expected Weco trajectory (0.928-
0.933 at 200 steps with full instructions) unchanged in shape; case study
absolute numbers used the leaky baseline so they shift slightly here.

Also expanded instructions.md and README to distinguish target leakage
(isFraud in the dataframe during aggregation) from time leakage (val
distribution in the encoder fit), with the fit-on-train / apply-to-both
pattern spelled out for future encoders Weco proposes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous prepare_data.py used pandas df.sample(random_state=42), which
produced parquets with shape, fraud rate, and DT range matching the original
case-study but DIFFERENT row content — baseline AUC came out at 0.9023
instead of the case-study's 0.9102. Recovered the original ad-hoc prep
recipe from a Claude Code session transcript and rewrote to match. Two
recipe details that turned out to matter:

1. Stratified train subsample preserving fraud rate, using a single global
   np.random.seed(42) followed by sequential np.random.choice calls (NOT
   pandas df.sample). The val subsample inherits the advanced RNG state.
2. Label-encode using categories from concat(train, val), and include
   "string" alongside "object" in select_dtypes — pandas 3 uses StringDtype
   for string columns and skips them when only "object" is included,
   silently leaving them as raw strings (which would then crash LightGBM
   or be dropped before fit).

Verified locally: re-running this prepare_data.py from a fresh Kaggle
download produces parquets with SHA-256s
  train: a2d7a6740559975b8e6d89bd605f1e29791dd7d3fee8abc6449552bbc18d29ae
  val:   8b426c8bf7fa845bc234dbce304b1107fd295143fac2398bab97b78805f50753
matching the case-study originals exactly. Baseline AUC = 0.910171.

README updated to reflect the now-deterministic 0.9102 baseline (the
previous "0.910 because we removed the leak" gloss was misleading — the
parquets themselves were different from the case-study). Reframed the
0.914 reference as the case-study's leaky-baseline AUC for clarity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recent CLI versions ship important fixes — most relevant here, 0.3.31
added queue-mode submit recovery (`_recover_queue_suggest`) and a
native `AutoResumePolicy` that together make the transient
`Failed to submit result` race invisible to the user. Anyone with an
older weco in their venv (e.g. operators reusing weco-gpu's pinned
0.3.25) was hitting this race and silently terminating runs short of
their step budget.

Switching the install command to `pip install --upgrade -r requirements.txt`
ensures users picking up this example always get the latest fixes,
regardless of what's pre-installed in their venv. Comment in the README
explains why we never pin weco-cli.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…to -loose

The previous fraud-detection example exposed `build_features(train_df, val_df)`
in a single file. The agent could (and frequently did) `pd.concat([train, val])`
and silently introduce time-leakage in encoders. We measured the inflation at
0.001-0.005 AUC depending on parquet contents, and found that prompt-level
"fit on train only" warnings only achieved ~67% compliance across seeds.

The new fraud-detection/ example uses a fit/transform interface:
  features.py: class FeatureBuilder with fit(X_train, y_train) + transform(X)
  model.py:    train_and_evaluate(X_train, y_train, X_val, y_val) -> float
  evaluate.py: frozen orchestrator that strips isFraud, calls fb.fit then
               fb.transform twice, and runs the model.

This kills two classes of leakage at the interface:
- isFraud is dropped before X reaches features.py (target leakage out).
- val data is never visible to fit() (time leakage out).
- transform() has no y argument (val labels can't influence val features).

Weco optimizes:
- features.py and model.py separately for scope=features / scope=model
- both together (`--sources features.py model.py`) for scope=full
The file boundary IS the scope boundary — no leaky helper module needed.

Existing single-file example renamed to fraud-detection-loose/ and kept as
a comparison artifact. README in fraud-detection/ links to it.

Baseline AUC: 0.909132 (deterministic; ~0.001 below the loose version's
0.910171 — that's the leakage inflation in the loose baseline).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The features.py docstring says y_train is a pd.Series (so users can call
.values, .map, .to_dict on it for OOF target encoding). Earlier evaluate.py
passed the result of .values.astype("int32") which is a numpy ndarray,
breaking any proposal that did `y_train.values` or `y_train.map(...)`.

Sanity-checked on a 3-seed Weco run: with the Series fix, proposals proceed
past step 1 instead of crashing on AttributeError.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant